Bicycle Accidents in Great Britain (1979 to 2018)¶

Programming for Data Science | Mohammad Idris Attal¶

In this project I will use the dataset regarding the Bicycle Accidents in Great Britain (1979 to 2018). This Dataset contains data such as the accident_index, number_of_vehicles_involved, number_of_casualties, date and time of accident, speed_limit, road_and_weather_conditions, day of the accident and finally the road_type in which the accident took place.

With the help of provided datasets, we can analyze different aspects such as analyzation based on number of casualties, number of vehicles, road conditions, speed limit and so on. In this project, the intention was to analyze the number of casualties for each year. In the phase of Machine Learning and model evaluation, Linear and Polynomial regression along with Random Forest algorithms are applied to predict number of causalities for future years based on the total number of causalities per year. The intention of applications of different models was to get the best optimal solution possible.

Formation of the Dataset¶

Accident_Index, Number_of_Vehicles, Number_of_Casualties, Date, Time, Speed_limit, Road_conditions, Weather_conditions, Day, Road_type, Light_conditions, Gender, Severity, Age_Grp

1. Importing Essential Libraries¶

In [1]:
#install these packages
#pip3 install pandas-profiling or pip install pandas-profiling
#pip3 install plotly
In [7]:
import matplotlib.pyplot as plt 
import numpy as np 
import pandas as pd 
import seaborn as sns 
plt.style.use('fivethirtyeight')  
import warnings
warnings.filterwarnings('ignore')
import plotly.express as px
from pandas_profiling import ProfileReport

2. Importing the Bicycle Accidents data-set¶

In [8]:
accident = pd.read_csv('./sets/accidents.csv')
biker = pd.read_csv('./sets/bikers.csv')
In [9]:
accident.head()
Out[9]:
Accident_Index Number_of_Vehicles Number_of_Casualties Date Time Speed_limit Road_conditions Weather_conditions Day Road_type Light_conditions
0 197901A1SEE71 2 1 1979-01-01 18:20 50.0 Snow Unknown Monday Dual carriageway Darkness lights lit
1 197901A2JDW40 1 1 1979-02-01 09:15 30.0 Snow Unknown Tuesday Unknown Daylight
2 197901A4IJV90 2 1 1979-04-01 08:45 30.0 Snow Unknown Thursday Unknown Daylight
3 197901A4NIE33 2 1 1979-04-01 13:40 30.0 Wet Unknown Thursday Unknown Daylight
4 197901A4SKO47 2 1 1979-04-01 18:50 30.0 Wet Unknown Thursday Unknown Darkness lights lit
In [10]:
biker.head()
Out[10]:
Accident_Index Gender Severity Age_Grp
0 197901A1SEE71 Male Serious 36 to 45
1 197901A2JDW40 Male Slight 46 to 55
2 197901A4IJV90 Male Slight 46 to 55
3 197901A4NIE33 Male Slight 36 to 45
4 197901A4SKO47 Male Slight 46 to 55

combining of two csv files of the same data-set¶

we combine the biker.csv file with accidents.csv file to make the data-set complete, these two csv files are part of the same dataset but since they are in different files so, we need to combine them.

In [11]:
df = accident.merge(biker, on='Accident_Index', how='left')
In [12]:
df.head()
Out[12]:
Accident_Index Number_of_Vehicles Number_of_Casualties Date Time Speed_limit Road_conditions Weather_conditions Day Road_type Light_conditions Gender Severity Age_Grp
0 197901A1SEE71 2 1 1979-01-01 18:20 50.0 Snow Unknown Monday Dual carriageway Darkness lights lit Male Serious 36 to 45
1 197901A2JDW40 1 1 1979-02-01 09:15 30.0 Snow Unknown Tuesday Unknown Daylight Male Slight 46 to 55
2 197901A4IJV90 2 1 1979-04-01 08:45 30.0 Snow Unknown Thursday Unknown Daylight Male Slight 46 to 55
3 197901A4NIE33 2 1 1979-04-01 13:40 30.0 Wet Unknown Thursday Unknown Daylight Male Slight 36 to 45
4 197901A4SKO47 2 1 1979-04-01 18:50 30.0 Wet Unknown Thursday Unknown Darkness lights lit Male Slight 46 to 55

3. Analyzation of our data-set (Profile Report)¶

Pandas profiling is an open source Python module with which we can quickly do an exploratory data analysis with just a few lines of code.

In [13]:
# for full explorative report please uncomment this part
# ProfileReport(df, title="Bicycle Accident", explorative=True)

# for minimal report
ProfileReport(df, title="Bicycle Accident", minimal=True)
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Out[13]:

4. Data manipulation for getting number of casulaties per year¶

Based on our prediction todos that we diefined we need to reterieve the total sum of number_of_casulaites based on each year.

In [14]:
# with the help of this lambda function we retreieve the year from the date          
df['Year'] = df['Date'].apply(lambda x: x.split('-')[0])

# we sum the Number_of_Casualties result based on group by on each year
df_peryearcas = df.groupby('Year')['Number_of_Casualties'].sum().reset_index()
In [15]:
df_peryearcas.head()
Out[15]:
Year Number_of_Casualties
0 1979 24007
1 1980 25206
2 1981 25723
3 1982 28782
4 1983 31010
In [16]:
# In order to understand better our data model, we visulize it with the help of plotting
fig = px.scatter(df_peryearcas, x="Year", y="Number_of_Casualties")
fig.show()

5- Prediction Model Application¶

Based on our data we will apply the three most famous prediction models on our data and then compare the accuracy rate of prediction based on each model.

5.1 Linear Regression¶

In [18]:
from sklearn.model_selection import train_test_split # Function for random splitting of data set
from sklearn import linear_model # Function to create linear regression models
from sklearn.metrics import mean_squared_error, r2_score # Functions used to evaluate models
In [19]:
# Partition the dataset into training and testing subset
xTrain, xTest, yTrain, yTest = train_test_split(df_peryearcas['Year'], 
                                                df_peryearcas['Number_of_Casualties'], 
                                                test_size=0.33, # Use 33% of the samples for testing
                                                random_state=42) # The random state tells the function to use the 
                                                                 # same random samples whenever its assigned the 
                                                                 # number 42. So your experiments can be replicated
5.1.1 Visual Representation of Training and Testing Division¶

In order to understand better our data spliting we visulize it via the ploting

In [20]:
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(go.Scatter(x=xTrain, y=yTrain,
                    mode='markers',
                    name='Training Data'))
fig.add_trace(go.Scatter(x=xTest, y=yTest,
                    mode='markers',
                    name='Testing Data'))

fig.show()
5.1.2 Create Regiression model¶
In [21]:
# Import the function to create linear regression model from the sklearn library
from sklearn import linear_model 

# Create linear regression instance and assign it to the variable 'linModel' 
linModel = linear_model.LinearRegression()

# Train the model using the training dataset
linModel.fit(X=xTrain.values.reshape(-1, 1), y=yTrain.values.reshape(-1, 1))
Out[21]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [22]:
# Test the model
linPrediction = linModel.predict(xTest.values.reshape(-1, 1))
In [23]:
#let's print the first few results
print("actual value", "predicted value")
i = 1
while i < 10:
    print(yTest.values[i],"       ", linPrediction[i])  
    i += 1
actual value predicted value
24512         [22784.56728232]
24405         [23101.08198044]
14823         [19619.4203011]
31010         [26582.74365979]
24434         [24050.62607481]
18311         [16137.75862176]
14749         [19302.90560298]
17358         [15504.72922552]
27087         [25949.71426354]
In [25]:
# Representation of our model in shape Linear Regression


plt.scatter(df_peryearcas['Year'],df_peryearcas['Number_of_Casualties'],color='red')
plt.plot(df_peryearcas['Year'],linModel.predict(df_peryearcas['Year'].values.reshape(-1, 1)),color='blue')
plt.title('Linear Regression')
plt.xlabel('Year')
plt.ylabel('Number of casualties')
plt.rcParams["figure.figsize"] = (29,9)
plt.show()
5.1.3 The Metrics for Evaluating Performance¶
In [26]:
mseLin = mean_squared_error(yTest.values, linPrediction)
print('The Mean Squared Error = {0:.3f}'.format(mseLin))
The Mean Squared Error = 6894820.703
5.1.4 The 𝑅2 (pronounced r-squared)¶
In [27]:
r2_Lin = r2_score(yTest.values, linPrediction)
print('Goodness of Fit: {0:.3f}'.format(r2_Lin))
Goodness of Fit: 0.724
5.1.5 Checking Prediction Result¶
In [28]:
linModel.predict([[1985]])[0]
# linModel.predict([[2022]])[0]
Out[28]:
array([25949.71426354])

5.2 Polynomial Regression¶

In [29]:
# Partition the dataset into training and testing subset
PxTrain, PxTest, PyTrain, PyTest = train_test_split(df_peryearcas['Year'], 
                                                df_peryearcas['Number_of_Casualties'],
                                                test_size=0.33, # Use 33% of the samples for testing
                                                random_state=42) # The random state tells the function to use the 
                                                                 # same random samples whenever its assigned the 
                                                                 # number 42. So your experiments can be replicated

5.2.1 Testing with different degress¶

We test our modle with different degrees to get the best optimal solution

In [30]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression


for i in range(1,5):
    
    poly=PolynomialFeatures(degree=i)
    X_poly=poly.fit_transform(PxTrain.values.reshape(-1, 1))
    lin_reg=LinearRegression()
    lin_reg.fit(X_poly,PyTrain)
    
    plt.scatter(df_peryearcas['Year'],df_peryearcas['Number_of_Casualties'],color='red')
    plt.plot(df_peryearcas['Year'], lin_reg.predict(poly.fit_transform(df_peryearcas['Year'].values.reshape(-1, 1))),color='blue')
    plt.title('Polynomial Regression with degree='+ str(i))
    plt.xlabel('Year')
    plt.ylabel('Number of casualties')
    plt.rcParams["figure.figsize"] = (29,9)
    plt.show()
    
    print("R2 score : %.2f" % r2_score(df_peryearcas['Number_of_Casualties'].values.reshape(-1, 1),lin_reg.predict(poly.fit_transform(df_peryearcas['Year'].values.reshape(-1, 1)))))
    print("Mean squared error: %.2f" % mean_squared_error(df_peryearcas['Number_of_Casualties'].values.reshape(-1, 1),lin_reg.predict(poly.fit_transform(df_peryearcas['Year'].values.reshape(-1, 1)))))
R2 score : 0.70
Mean squared error: 6730205.54
R2 score : 0.71
Mean squared error: 6402172.30
R2 score : 0.86
Mean squared error: 3138045.75
R2 score : 0.86
Mean squared error: 3148710.45
5.2.2 Choosing the degree for our model¶
In [31]:
# Polynomial curve of degree 4 is a perfect match so we train our model with this degree
poly_reg4=PolynomialFeatures(degree=4)
X_poly4=poly_reg4.fit_transform(PxTrain.values.reshape(-1, 1))
lin_reg_4=LinearRegression()
lin_reg_4.fit(X_poly,PyTrain)
Out[31]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [32]:
polyPrediction = lin_reg_4.predict(poly_reg4.fit_transform(df_peryearcas['Year'].values.reshape(-1, 1)))
In [33]:
#let's print the first few results
print("actual_value","predicted_value")
i = 1
while i < 10:
    print(PyTest.values[i],"       ", polyPrediction[i])  
    i += 1
actual_value predicted_value
24512         25779.807278633118
24405         26533.32789373398
14823         27113.79350376129
31010         27531.701882839203
24434         27797.566972732544
18311         27921.918841362
14749         27915.303717136383
17358         27788.283991336823
27087         27551.4381775856
5.2.3 The Metrics for Evaluating Performance¶
In [34]:
mseLin = mean_squared_error(df_peryearcas['Number_of_Casualties'].values.reshape(-1, 1),lin_reg_4.predict(poly.fit_transform(df_peryearcas['Year'].values.reshape(-1, 1))))
print('The Mean Squared Error = {0:.3f}'.format(mseLin))
The Mean Squared Error = 3148710.449
5.2.4 The 𝑅2 (pronounced r-squared)¶
In [35]:
# Approximate the variance of error
r2_Lin = r2_score(df_peryearcas['Number_of_Casualties'].values.reshape(-1, 1),lin_reg_4.predict(poly.fit_transform(df_peryearcas['Year'].values.reshape(-1, 1))))
print('Goodness of Fit: {0:.2f}'.format(r2_Lin))
Goodness of Fit: 0.86
5.2.5 Checking Prediction Result¶
In [36]:
lin_reg_4.predict(poly_reg4.fit_transform([[1985]]))
Out[36]:
array([27921.91884136])

5.3 Random Forest Regressor¶

In [37]:
df_peryearcas.head()
Out[37]:
Year Number_of_Casualties
0 1979 24007
1 1980 25206
2 1981 25723
3 1982 28782
4 1983 31010
In [38]:
# Partition the dataset into training and testing subset
RxTrain, RxTest, RyTrain, RyTest = train_test_split(df_peryearcas['Year'], 
                                                df_peryearcas['Number_of_Casualties'],
                                                test_size=0.33, # Use 33% of the samples for testing
                                                random_state=42) # The random state tells the function to use the 
                                                                 # same random samples whenever its assigned the 
                                                                 # number 42. So your experiments can be replicated
In [39]:
# import of random_forest_regressor
from sklearn.ensemble import RandomForestRegressor

random_forest = RandomForestRegressor(n_estimators=100)
random_forest.fit(RxTrain.values.reshape(-1, 1), RyTrain.values.reshape(-1, 1))
predictions = random_forest.predict(RxTest.values.reshape(-1, 1))
In [40]:
#let's print the first few results
print("actual_value","predicted_value")
i = 1
while i < 10:
    print(RyTest.values[i],"       ", predictions[i])  
    i += 1
actual_value predicted_value
24512         23981.59
24405         23993.21
14823         16049.21
31010         29206.3
24434         26765.62
18311         18646.73
14749         15065.69
17358         18461.36
27087         29784.61

5.3.1 The Metrics for Evaluating Performance¶

In [41]:
mseLin = mean_squared_error(RyTest.values, predictions)
print('The Mean Squared Error = {0:.3f}'.format(mseLin))
The Mean Squared Error = 1634671.398

5.3.2 The 𝑅2 (pronounced r-squared)¶

In [42]:
# Approximate the variance of error
r2_Lin = r2_score(RyTest.values, predictions)
print('Goodness of Fit: {0:.3f}'.format(r2_Lin))
Goodness of Fit: 0.934

5.3.3 Checking Prediction Result¶

In [43]:
random_forest.predict([[1985]])[0]
Out[43]:
29784.61

6. Conclusion¶

As a result of working with this dataset, we have achieved the required skills and knowledge to manipulate and train the data to get the required result that we want such as prediction of number of casualties of bicycle accidents based on each year. In addition, this experimentation gives me the knowledge and courage to further play and proceed with this dataset and explore new results in future, such as the consideration of involving more columns of the dataset and also try to apply new algorithms to further increase prediction accuracy.

In [ ]: